DOCUMENT RESUME 



SB 235 206 TM 830 620 

AUTHOR Jarjoura, David 

TITLE Confidence and . Tolerance I nteryals for True Scores. 

ACT Technical Bulle t i h > _Numbe r 42. 
INSTITUTION American Coll^ Testing Program, Iowa City, Iowa. 

Research and Development Di v . 
PUB DATE Jul 8 3 

NOTE 6 3p. 

AVAILABLE FROM Research and Development Division, The American 

College Testing Program, P.O. Box 168, Iowa City, IA 
5 2 2 4 3. 

PUB TYPE Reports - Research/Technical (143) 



EDRS PRICE MF01/PC03 Plus Postage . 

DESCRIPTORS ^Educational Testing,; Measurement Techniques; 

*Models; *Scores; Statistical Analysis; Test 

Interpretation 

IDENTIFIERS ^Confidence Intervals (Statistics); ^Tolerance 

Intervals (Statistics) 



ABSTRACT 

Issues regarding confidence and tolerance intervals 
are discussed within the content of educational measurement. 
Conceptual distinctions are .drawn between these two types of 
intervals; and examples, under various error and true score models, 
are used to compare such intervals. It is shown that there tend to be 
only small differences in tolerance intervals under different true 
score models. It is also demonstrated that confidence and tolerance 
intervals are not only quite distinct conceptually , but also can be 
very different numer i cally . Points are raised about the usefulness of 
tolerance intervals when the focus i s on a particular observed score 
rather than a particular examinee. (Author ) ; 



******************* ^ ************ ^ 

* Reproductions supplied by EDRS are the best that can be made * 

* from the original document . * 
************************************** 




t 



David Jarj oura 

The American College Testing Program 
July 1983 



PERMISSION TO REPRODUCE THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC)." 



US DEPARTMENT OF EDUCATION 
NATIONAL INSTITUTE Of EDUCATION 

: UMH.ATIONAL HI SOLJRCI S. INf ORMA1 ION 

(J! NH H if HlCi 

A principal purpose of the ACT Technical Bulletin Series is to -X''-- < m^.: ..»,».».; 

provide timely reports of the results of measurement research tt „.. lt(1(U 
at ACT. Comments concerning .technical bulleti/is are solicited * ^^.y..' V;,'.!'. .m,, ;.„nu 

from ACT staff. ACT'S clients, and the professional community ;;: '"" ; ' ! "' ; "* 
at large. A technical bulletin should not be quoted without 
permission of the author(s). Each technical bulletin is automatically. 

superseded upon formal publication of its contents. ~c 



Research and Development Division 

The American College Testing Program 

P.O. Box 168 

Iowa City, Iowa 52243 



ERLC 



v Table of Contents 

\ : 

Attract. •' 1 



in troduction 



Confidence Intervals for True Scores • ^ 

A Weak Claim About Confidence Intervals........ 6 

Coverage of True Scores Conditional on Observed Score: Tolerance Intervals 8 

Tolerance. Intervals 9 

_j m 

Two Perspectives on Tolerance Interval Estimation 13 

Caparison of Tolerance Intervals .Under Four True Score Models 16 

Description of. the Models 16 

Calculation of Tolerance Intervals •••• 19 

Examples , *" 

Some Gene ral Comparisons ..............:.:.............--.- 30 

Summary and Conclusions for Tolerance Comparisons. 33 

Caparison of Confidence Intervals..::......-' .... 36 

_ - - 3 Q 

Two Examples. - ■ - - - 

Comments on the Binomial Error Model 40 

Comparison of Confidence and Tolerance Intervals..............-/ ^2 

44 

I) J ^cuss ion ' ' 

Checking Assumptions... - ^ 

Bayesian Credibility Intervals - 45 

, , . 46 

conclusions ' 

_ " _ AO 

\\e ferences • ■ • H 



ERLC 



List: w-i Tables 



WibLu Page 

1 Tolerance Intervals for n = 35, E :/n =.5,0^, = .0423, and 

S= = .027 ' X ^ 51 

L 

2 Tolerance. Intervals for n = 35, E M/n = .75, af. = .0227 , and 

S-! = .013 r^. .................. 52 

L 

• , r— . . 

3 Tolerance Intervals for n = 25, E X/n = :5, o~ r v= .0444, and 

S- = .027..:......:...:............:.........^".. 53 

1 

4 Tolerance Intervals for n = 100, E X/n - .75, o|. = .0119, and 

= .020. .. 54 

i 

5 Mean-A'jSsolute-Oif f erences of Tolerance Limits rnr Seven Test Charac- 
teris t ics ; - » 55 

6 Mean Differences of Upper Limits For Observed Scores Below the Mean. 56 

7 Mean Widths of Intervals...:.............. 57 

.__.„____._ en 

8 Confidence Intervals for n = 35........... 

9 Confidence Intervals for n = 100 59 



ERLC 



1 



Abstract 

Issues regarding confidence and tolerance intervals are discussed within the 
context of educational measurement. Conceptual distinctions are drawn between 
these two types of intervals; and examples, under- various eu'or and true score 
models, are used to compare such intervals. It is shown tha t there tends to 

be. only small differences in tolerance intervals under different true score 

1 ...... 

models. It is also demonstrated that coji£a.dence and tolerance intervals are 

no c only quite dis'tlnct conceptually, but also can be very different numerically. 

Points are raised about the usefulness of tolerance intervals when the focus 

is on a particular observed score rather than a particular examinee. 



-Introduction 

Th rough the use of confidence intervals for true scores, one can discourage 
interpretations of observed^ test scores that are too literal. Such an interval 
also provides a gauge for the potential error associated with a measurement pro- 
cedure. This paper discusses confidence internals within 

the context of educational measurement, and contrasts them, conceptually and 
through numerical examples, against tolerance intervals. A major portion of the 
paper compares tolerance intervals that are based on various true score models. 

Some fundamental issues regarding true score confidence intervals are dis- 
cussed here so that distinctions can be drawn between various interpretations of 

i 

these intervals, and so that clear contrasts can be made with true score tolerance 
inter\als. Tolerance intervals, as such, have not been previously suggested for 
true scores, although intervals with the same or similar form have appeared in 
both the early and recent literature. For example, intervals around the familiar 
regressed score estimates can be viewed as tolerance intervals under certain 
assumptions. Also, true score tolerance intervals can resemble Bayesian credi- 
bility intervals; but because true score tolerance intervals fall within the 
framework of the classical regression model, the two approaches are quite distinct 
conceptually . 

Generally, confidence interval procedures are designed to cover, with a 
chosen probability, the value of a parameter. It is often emphasized that a rea- 
lized interval, i.e., one that is based on a particular set of observations or 
realized sample, either does or does not cover the value of a parameter, and the 
interpretation of a realized confidence interval must be in terms of the procedure 
on which it is based. An interpretation that is often suggested is" that a. confi- 
dence interval orccedure will, over repeated applications, cover a parameter a 
c no sen proportion of the time. 



3 



in a measurement context, a realized confidence interval for a particular 
examinee is often based on the observed score obtained by that examinee and a 
standard error of measurement that is estimated fron: a large sample of examinees. 
Typically, more than one observed test score is not available for a particular 
examinee, but we can interpret a confidence interval procedure for that examinee 
in terms of his/her hypothetical distribution of observable scores. The mean or., 
expecied value of this distribution is the parameter of interest; i.e., his/her 
true score is the parameter to be covered by a confidence interval procedure. 

The assumption that the standard error of measu remen t-- the standard deviation 
of the hypothetical distribution—is the same for all examinees justifies the use 
of a single estimate of this standard error for constructing confidence intervals 
across examinees. But a weaker claim could be rm.de about the overall confidence 
interval procedure which does not depend on this assumption. Instead of claiming 
that a confidence interval procedure covers a particular examinee's true score 
with a chosen probability," it might be claimed that "on average" such a procedure 
zovers the- true scores of a population of examinees a chosen proportion of the 
time. This average probability is taken over the examinee population and allows 
for the possibility that a confidence interval procedure for a particular examinee 
does not have a coverage probability equal to the average probability across examinees 
The average coverage claim is explored in this paper in order to determine the con- 
ditions that make it accurate. 

The issue of average coverage of a confidence interval procedure raises other 
issues regarding interval estimation of true scores. In a measurement situation 
in which potentially many intervals are reported, it- seems natural to describe 
the statistical properties of the overall procedure of setting intervals for some 
population of examinees, rather than restrict attention to the properties for an 
isolated examinee. Consider the typical situation in which all examinees with 



ERIC 



Che same observed score receive the same interval. What seems of special interest 
is the probability of coverage of true scores for an interval biased on a particular 
observed score. More precisely, we can ask: What is" the proportion of the true 
score distribution, conditional on a particular observed score, that is covered by 
an interval based on tuat observed score? This is to be distinguished from the 
interpretation of a realized confidence interval based on a particular observed 
score, which must be in terms of the confidence interval procedure rather than 
the realized interval. In a measurement context, there is a distribution of true 
scores associated with a population of examinees. For this reason, we can inter- 
pret an interval based on a particular observed score in terms of the conditional 
(on that score) distribution of true scores rather than in terms of a particular 
examinee's true score. Thus, we can design an interval to cover some proportion 
of the conditional true score distribution. 

An interval designed to cover some proportion of a distribution is usually 
referred to as a tolerance interval. Such intervals are the major focus here. 
3ecause tolerance intervals for conditional true score distributions require a 
"strong" true score model (an explicit specification of the joint distribution 
of ^served and true scores) , four such models are used for comparing the inter- 
vals they produce. The comparisons, which comprise a major portion or the paper, 
are based on a variety of test characteristics adapted from standardized tests. 
Similarly, confidence intervals from three error models are compared, and are 
then contrasted with tolerance intervals. 



ERIC 



8 



Confid ence Intervals for Tru e Score s 
Considered in isolation, the process of making an inference about a particu- 
lar examinee's true score suggests that a confidence statement can be a useful 
part of the process. When we focus on examinee a, we are interested 

in a parameter — the true score of that examinee. with t defined as the mean 

1 a a 

»u r observable scores, X - for that examinee, it seems natural to attempt to ac- 

a ' 



aire information about the distribution of X . .And, a confidence interval procedure 

a 

eems to be a succinct method for expressing such information. For example, the con- 



fidence statement P[L(X ) < T < U(X-)j « i - a , where L and U are variables depen- 

a a ~~ a 

dent on the random variable X (and possibly other random variables) and 1 - jl 

a 

is the confidence coefficient or the probability that potential intervals cover 

, can provide information about the distribution of X- and the accuracv with 
a- r a 

which M "measures t . : 

a a 

Obtaining enough information to feel comfortable in making such statements 

mi^ht require several observations on examinee a. But in a measurement fcontext , 

certain factors usually preclude such an approach. Because of the difficulty in 

obtaining several observations on full test forms, and potential problems with 

practice, fatigue, motivation, etc:, more than two observations on any examinee 

are rareiv obtained. instead, properties of the overall measurement procedure, 

based on a population of examinees, are used to estimate conficence intervals for 

^ each examinee. Strong assumptions could be used to justify a conficence interval - 

procedure for a particular examinee. For example, the error variable for examinee a, 

e = \ - r , could be assumed normal with the same variance, c- , for all examinees 
a a a <£ 

\n accurate estiinate of a 2 could then be obtained, sav, through the administration 

e 

of two parallel forms to a large sample of examinees; This would allow a confi- 
dence statement of the form ?(X- - ca._ < z. < X. + ca_) = 1 - % to be used for 

a e - a - a e 

examinee a. The c could be determined from a z or t table depending on the sample 



ERLC 



s Lze for a - . ~ 



6 



Stiii stronger assumptions might: be used so that even estimation of a- is 

% e 

avoided. For example, with number correct scoring, the binomial error model 
'.Lord £ Novick, 1968, chaps: 11 & 23) , is sometimes viewed as appropriate. If 
this model holds for an examinee; we can simply use the examinee's observed score 
and the number of items in a test to enter 'a table of confidence intervals for a 
binomial parameter. This would provide a confidence interval for the examinee's 
proportion correct true score. 
-1 T ^ ea ^ Claim Abou t Confidence T ntervals 

Such strong assumptions allow the strong claim that is made about a confi- 
dence interval procedure for a particular examinee. However, if such assumptions 
are unwarranted, one could sti!_l make a weaker claim about the confidence interval 
procedure. For example, it might be claimed that, on average, intervals of the 

general form X +- z , t- cover examinees ' true score^ with probability 1 - J. , where 
- 'x 1 2 e 

the average is taken over the population of examinees for which such intervals are 
reported, arid where " refers to the 1 - z/Z cumulative percentage point of the 
standard normal distribution. In this case, the confidence interval procedure for 
i particular examinee can be associated, with a coverage probability that is greater 
or Less than 1 - t , but the average across examinees is 1 - ot . 
This average coverage is expressed as 

E1? , Xj . : . <,. >iV i i-> . . .<» 

where E is the expectation operator over examinees, the probability statement is 

the coverage probability for examinee a of X +■ z , and ~^ is the average 

a e e • 

measurement error variance for the population of examinees. One way o-f writing * 
tiiiis in integral form is 

s-'~ - \ iP (e) = L - ;i , (2) 



"J 



i/2 e 



7 

where ? Ce^i is the cumulative distribution function of e- and the integral is in 

a * • a 

Stieltjes form to allow for discrete e^ (both end points are included 'in the inte- 
gration). The summation is over the N examinees in the ■ population for which the 
average coverage claim is made. We can switch the order of integration in Equation- 
1 , that 




N'ow the limits i" z / J can be viewed as points on a mixture of the distributions 

3/2 e 

of che-error variables, or the marginal distribution of error. Thus, the average 

.■.average claim simply states that the area in the marginal distribution between 

■« < 

thJ -wo joints 1" 2 > is 1 - ot . 

1/ 2 e 

As defined, the mean of e for each examinee is zero, so these two points 

a 

are equidistant from the mean. Of course, in order for this average coverage 

claim to hold at ever/ value of z , normaiitv of the marginal error distribution 

j. / _ 

Ls necessary. However, if we only make the following claim, "X ± 1.96 j_ has an 
average coverage of .95, approximately," many distribution shapes will do. 

If we just assume that the marginal distribution of error has one mode, : we 
can use the Camp-Meide 11 inequality (Rao, 1973, p. 145) which states that 

a 

4(1 + § 2 ) 

P( X - > v~) < , 

9(.\ - s) 2 

where - is the mean, 7 is the standard deviation, s is the absolute value of the 
number of standard deviation units that - is from the mode, and « > s: Say, for 
exampie, 5 - then average coverage cc a + 1,96 : is greater than .85 for anv 



ERLC 



8 



■ini-nodaL distribution: Thus, with an accurate .estimate of ?r (the -average of : 

e.' 

itia: vtdoaL error variances) and z . , around 1.96, one might feel comfortable in 

ill 

m; i king an iver:ic;c j coverage ciaim around .9. 

i*:.ui.irLy N an average coverage ciaim is fairly weak. it describes a proper c ■ 
■ "he overall interval estimation procedure in a measurement context, but does' 

Little to describe the Limitation of the information from X about examinee a's 

a 

:":'Ut: si-.ire. N'or does it make any claims about what to expect at different score 
points. If evidence is available indicating that error variance differs along the 

score scale, then an average coverage claim seeing especially uriirif drmative . How- 
ever, v:ons iderat ion of such differences when constructing intervals could allow a 
^Laim of average coverage for different ranges of observed scores. 

C^j^er-a^e- _o_f- True Scores Conditional on Observed Score : Tolerance Intervals 

That an average coverage ciaim over dif ferent parameters (i.e., true scores) is 
sensible in a measurement context raises the question of whether we are always 
interested in an interval estimate for a particular examinee. Under circumstances 
in which a particular examinee's score is being interpreted by a career ^counselor „ 
or classroom teacher, a proper confidence interval seems quite useful in combi- 
nation with other information about that examinee. In contrast, the process 
or score reporting, in which large numbers of examinees are given the same score 
and interval, is not intimately concerned with a particular examinee. Rather, 
there is a distribution of true scores that is referenced by a particular ob- 
served score: Thus, in a measurement context we can interpret an observed 
score in terms of the conditional distribution of true scores associated with it. 



^ 

\ 



12 



9 



In a cynical situation in which the group of examinees with the same ob- 
served score receives the same interval estimate, it seems natural to inquire 
about the proportion of the distribution of true scores given an observed score that 
is covered by the interval: In other worcis, for an observed score x (realized 
value of the M variable chat represents observable scores for the population of exam- 
inees), there is a proportion of the conditional distribution of true scores 

thar is' covered bv an interval like :< + c z . This proportion, which is a 

e 

conditional (on x) probability, is conceptually distinct from a confidence coeffi- 
cient, tf we condition on an ^observed score;, then a confidence interval does or 
does not cover a particular examinee's true score; i.e., the conditional prdb- 
abilitv is not in reference to any particular examinee. Later, intervals of the 

x ± c j are evaluated in terms of the conditional distribution of true scores. 

e 

Tolerance Intervals 

Probabilitv statements about conditional true score distributions require 
strong assumptions or data that are usually not available. In particular, the 
joint distribution of observed and true scores is needed: For expository purposes, 
we .issume that the distribution of error conditional on true score is normal with 
mean zero and a variance that is constant across true scores. In addition, true 
score is assumed normal. Thus, X is the sum of two independent and normal vari- 
ables - and e, where : is normal (u i J^) , and e is norma 1(0, ~-) 

e 

Under this model it is well-known that the conditional distribution of r 

iiven \ = x is hormai (o z :< + (1 - o 2 )u, , where z~ = C0RR(X, t)~ = 1. - ■' ' :~ 

e '»\ 

We will refer to the mean of" the conditional distribution as 

r(x) = + (1 - o 2 )u , (5) 



which is the familiar repressed score estimate of true sco^e (see e . j . , Lord 
Novick, 1963; do. 64-69). . _ 

i 3 

ERIC 



10 



Since the conditional distribution of t given x is normal, 

r(x) ± z. iz- (5) 

is an interval which covers the central 100(1 - ct)% of the conditional true score 
distribution associated with x (this holds for all values of X)'. It is referred 
to as central because both tails of the conditional distribution, not covered by 
the interval, contain 100*d/2% of the true scores. A central iritrerval, in the 
case of the normal, is also the shortest interval that covers 100(1 r ct)% of 
the conditional distribution. , 

Such intervals are quite distinct from confidence intervals;' As noted, a 
confidence interval procedure is designed to cover, with a chosen probability, 
some parameter of a distribution. In contrast, the above interval is designed 
to cover a chosen proportion of the distribution of a random variable. Intervals 
of this type are referred to as tolerance intervals. Proschan (1953) provides 
some, basic comparisons between tolerance and confidence intervals. Cv , ; 

3efdre discussing some issues regarding the estimation of tolerance intervals, 
some comparisons will be made between confidence and tolerance intervals in the 
context of measurement. Stanley (1971, pp. 3 79-382 ), among others, discusses an 
interval similar to that of Equation 5 and appropriately refers to it as a "confi- 
.L-'tice interval in a loose sense. Also, a few of the following points have been 
Lurched on in trie measurement literature, though from a different perspective. 



The difference between Equation 5 and his Equation 19 is his use of the t 

distribution instead of the z and his use of estimates of the parameters u , o - , 

and . Estimation issues for tolerance intervals are complex and have not been 
e 

solved for the case where the variable of interest is unobservab Le . Thus, use of 
,_i c _: is c ribut ion just provides wider intervals than r_he z. 



II ' 



The j us tif icatiori gi^erl above for using a confidence interval in a measure- 
ment context is that a score interpretation situation calls for isolated interest 
In a particular examinee's true score: A mistaken interpretation of a reported 
or realized confidence interval based on a given score might then be that it pro- 
vides a range of probable values for the true score of that examinee. Instead 
of considering a reported confidence' interval as an indication of the accuracy 
with which an examinee's true score is estimated, its meaning is distorted to 
include consideration of the likely values of true scores in the population of 
examinees for a given score x.~ Within the context of classical confidence inter- 
val estimation*;, such an interpretation makes little or no sense because, again, 
the value of a single parameter is of interest. Within a measurement context, 
however, there is a distribution of true scores of interest, so that such an in- 
terpretation may be desirable. But confidence intervals are not designed to pro- 
vide such interpretations, and they would lead!" at the least, to inaccuracies. 

Consider again the model with normal and independent error and true scores 
land no two examinees have the san>° true score). A confidence interval 
of the general form X ± would, for every examinee, have a confidence 

coefficient of L - i; i.e., the probability that an interval of this form covers 
an examinee's true score is 1 - i - In contrast, a reported confidence interval 
based on a realized value of X , x ± z : ■ r - ? q , either covers an examinee's true 
score or does not. Mow, by considering the population of examinees, this same 



~ It is true that Bayesian approaches allow isolated interest in an 
examinee's true score and a statement about probable values of that true score. 
However, the nature of probability changes, and, in any case, a classical con- 
fidence interval procedure would not be used > typically. 



13 



ERIC 



reported interval will, typically, cover more or less than 1 - :t of the true 

Scores that can be associated with x. The interval x ~t z - O- covers somewhat 

a/2 e 

more than 1 - i of the true scores associated with x, when x is close to li , 
and less when it ts far away; ft is easily shown that the average proportion 
covered, taken across the variable X, is in fact i - a ; 

In order to determine, under this model, the proportion of the conditional 

true score distribution covered bv a realized interval of the form x t z ■■ a , 

■i/2 e 

we need only specify the reliability (o 2 ) and the number of units x is from 

. Let us take = .8, and for simplification z = 1 (i.e., 1 - a = .68). 

a/ 2 

When x = u , the realized interval x i" a covers the central 74% of the distri- 

e 

bution of true scores associated with x. When x = u + a or x = u - i , x ~t a 

x x e 

covers 68% of the conditional true' score distribution, but not the central 68%, 

i.e., the areas in the tails of the distribution n t covered by the interval are 

uriedual. When x = u + 2o- or x = u - 2c - , x ± a - covers orilv 53% of the 

x x e 

Conditional true score distribution — again not the central 53%. To see this, 

consider that, under the model, the area between x - c arid x + a is being 

e e 

evaluated for the distribution of x given x which has mean o~ x + (1 - z~)u and 

i 

variance ** t*- . . Thus, except when x = u , realized confidence intervals will 
e 

be centered farther trrora u than the mean (center) of the conditional true score 
distribution. In contrast, tolerance intervals are centered on this mean. Even 
though on average the proportion covered is 68% for this example, one-third of 
the confidence intervals will cover less than 68% of the conditional true score 
distributions. Thus for this example, at least, realized confidence intervals 
would be very misleading if interpreted as tolerance intervals. 

Another related criticism against interpreting the interval x + a e 
as a tolerance interval for x is that it can be considered more appropriate for 
observed scores that are farther from 'J (more extreme) than x. This is because 
thi.s interval covers a greater proportion of the conditional distributions of 

in 



13 



Uru^ scores for scores more extreme than x than it does for x itself. Again, 
tthi§ interval is not centered on the mean of the conditional true score distri- 

button. Specifically, there is an observed score x* such that P(x - z ln C- < i 

; ct/^ e 

< ;< 4- % a - ' X = x*) takes on the largest Value, and this x* is more distant 

:l/2 e 

t^r ^ j j than is x. The value of this probability is also larger for all values 
tetveen :<* and x than it is for x, and the same holds for more extreme scores 
btstveen -X* - x and x* : The value of x* is u + Cx - u)/o^ ; To understand why 
the probability is largest under x*,' consider that ^he conditional mean of t 



aSy^o X* is x; i.e., x* makes the interval x ± i^-^J; _ centered around x(x*). 

- a p- e 

Ar^d> the values between 2x* - x and x are also associated with larger probabili-^ 
c^LeS than x simply because their conditional means are closer to t(x*) than is 

These comparisons have been made within the context of the normal error-normal 
C rU*3 score model. However, similar, though perhaps not as strong, statements could 

ii e >nad£ for other models. We can expect, for instance, that for most reasonable 

. . . — . 3 

c rtife score models, x will always be further from u than the mean of t given x. 

Pe f s p e c t i v e s _ori v Tolerance interval Es timation 

When parameters of a distribution are not known precisely, estimation of 

c al^rarice intervals have been found to be fairly simple or quite complex depending 

iKn„ /imong other things, the properties required of the estimator. There are two 



3 _._ . _ . . 

Consider the binomial error model and a true sex>re distribution that is 

a^sV>med aniform between '0 and 1: The mode of the conditional distribution of t 
g^iv^n Y. is then :</ n (Novxck & Jackson, 1974, p. 114): Since the highest density 
region converges on the mode, this provides a contrast to comments above. However, 
the uniform is an interesting prior but is unrealistic as an empirical distribution 
tor true scores. Further, central tolerance intervals converge on the median rather 
ttaafl the mode. - 

ERIC 



14 



alternative properties thpK are discussed. One is that the interval estimator 
■.■over on ave-rag-e the. desired proportion of the distribution. The desired pro- 
portion is then referred to as the expected coverage; For the normal univariate 
case, Proschan (1953) provides such an estimator. The expected coverage require- 
ment of tolerance interva-ls for the conditional distribution of true scores 
(given :<) can be written as 



J t(Tix)dT 

t(x) 



1 ^ a , 



where L'(x) and L(x) represent upper and lower limits of the tolerance interval 

. .... V 

for given x. In the discussion above, U(x) = t(x) + z - , J- and L(x) = t(x) - z , a 

ct/2 e a/2 i 

But without knowledge of u , and t"(x) and L(x) are random variables that 

depend on estimates of these three parameters. Thus, the expectation is over 

V i :•: ) and L( ) . 

The other alternative property places a confidence statement on the pro- 
portion of the distribution covered by an estimator. It places a probability 
on the event that a tolerance interval estimator covers ax- leas£- the desired 
proportion of the distribution. In terms of the conditional distribution of 
true score (given x) this can be written as 



ERLC 



L ' 



f(::x)dT L 



where \ is the confidence coefficient. When the parameters of the conditional 
distribution are assumed known, X = 1. Otherwise the probability depends on 
';"<>:) and bi:-i). As an example, 'J(x) and L(x), as estimators of tolerance 
Limits, might he chosen so that the probability is .95 that the limit estima- 
tors cover at least 68% of the true score distribution associated with x. 

Ik 



15 



Probability of coverage estimators receive more attention than expected 
coverage estimators, mainly because they provide a more informative statement 
about the behavior of an estimator. Some even define tolerance intervals only 
ia terms of probability of coverage. Also, probability of coverage is more use- 
ful . in a major application of tolerance intervals, namely quality control prob- 
lems. It does, however, create greater complexities, and typically A is close 
to 1. which produces wider intervals than expected coverage intervals. Wald and 
VoLfowitz (1946) first provided an approximation under normality for a probability 
of coverage estimator. Wallis (1951) solved the estimation problem for the linear 
regression model, which has some relevance to our problem. More current work 
has focused on simplifying methods and extensions to simultaneous intervals^ for 
the regression case (see, e.g., Lieberman & Miller, 1963). 

Tolerance intervals are rarely discussed in statistical methods texts 
(Dixon S Massey, 1962, p. 199; and Graybill, 1976, pp. 270-275, are two exceptions). 
Instead, the related issue of prediction intervals is often discussed (see e.g., 
Graybill, 1976, pp. 267-270 for prediction intervals in the linear regression model). 
Such an interval is used to predict a range of probable values for some future 
observation or linear function of several observations: Note that a 1 - a pre- 
diction interval for a single observation is the same as a tolerance interval 
/ 

wi'th expected coverage of 1 - a (Proschan, 1953). The key to the identity is 
that the distribution of a single future observation is the distribution for which 
a tolerance interval is desired. 

Because none of the research on the estimation of tolerance intervals con- 
siders the case in which the variable of interest is unobservable , none of it is 
directly relevant to the problem at hand. Even prediction intervals for the linear 
regression model would not serve as an expected coverage interval for the normal 
error and true score model because the basic assumptions are quite different under 
the two models. 

IS 



16 

Comparison of Tolerance Intervals Under Foar True Score Models. 
The focus of this section is on the comparison of tolerance intervals cal- 
culated under different measurement model assumptions. Since , to! erance interval 
estimators have not been derived for these true score models, comparisons are 
made under the presumption that accurate estimates of model parameters are avail- 
able. Essentially, all that is presumed is that large enough samples are avail- 
able to accurately est imate the mean and variance of the observed scores : This 
is because two of the models need only these two parameters for calculating tol- 
erance intervals; and the other two models need only one additional parameter that 

does not seem to play a substantial role in the intervals. It seems important 

( 

to focus attention on a comparison of true score models before tolerance interval 
estimators are derived because of the strong and sometimes unwarranted assump- 
tions associated with each. The effects of differences in assumptions on dif- 
ferences in tolerance intervals can facilitate not only an informed choice of a 
model for calculating intervals with large samples but also a choice the deri- 

vation of estimators. 

First, the four true score models are described. Equations for calculating 
intervals under these models are then provided. This is followed by detailed 
comparisons among tolerance intervals based on test characteristics that were 

adapted from standardized tests. 
3e-s^4 p t io n, o f -t-he Models 

For the comparison of tolerance intervals, the following measurement models • 
were used: (1) the normal model discussed above in which the conditional dis- 
tribution of observed score (given true score) is normal and the distr. :i 
of true score is normal (NORM); (2) the conditional distribution of obse 
score is binomial and the distribution of true score is beta (BETA); (3) l 
conditional distribution of observed score is binomial but an angular (variance 
stabilizing) transformation provides approximate normality, and yields a normal 
true score distribution (BINQJXM) ; and (4) the conditional distribution of observed 
score is compound-binomial but an angular transformation provide.' approximate 
O normality, and yields a normal true score distribution fCONQRfi) • c ; • 

ERIC 



17 



All four models have been discussed previously in the literature and except 
for the NORM model, they were developed for number correct scoring: Lord and 
Novick (1968, chap. 22) provide a discussion of the NORM model — especially normal 
and independent error. Although this model was not designed especially for 
number correct scoring, it is included because of its convenience, historical 
p:ot.larity, relation to the BINORM and CONORM models, and as a contrast with the 
uher three models. 

The BETA model is discussed in detail by Keats and Lord (1962) and subse- 
quently in work concerned with mastery testing (see, e.g., Huynh, 1976). Al- 
though the beta-binomial com&Tnla^ion^ convenience, and the re- 
sulting model depends on just two unknown population parameters, the fit to 
number, correct observed score distributions is often impressive (see, e.g., 
treats ^ Lord, 1962)/. Wilcox (1981) reviews competitors to this model 
and concludes that it frequently gives satisfactory results and that choosing a 
more complex model involving additiona(l^free parameters can be quite difficult. 
Robustness of a methodology based on this model has also been shown (Gross & 
Shuiman, 19 80). v " 

The BINORM model was adopted from Bayesian treatments of estimating true 
scores from observed number correct scores (Jackson, 1972; Novick, Lewis, & 
Jackson, 1973; and Lewis, Wang, & Novick, 1975; also, Hambleton, Swaminathan, 
Algina, & Coulson, 1978, provide a convenient summary). In this treatment the 
conditional distribution of observed score given true score is assumed to be 
binomial and an angular transformation provides, approximately, a normal error 
variable with stable variance (4n 4- 2)"" 1 across the true score range, where n 
is the number of items in & test. It is also assumed that the angular trans- 
formation yields, approximately, a normal true score variable (or prior in their 



9 

ERLC 



18 



Bavesinn treatments) . This transformation results in an expansion of the true 
score scale at the extremes which makes the assumption of normality (unbounded 
taiLs) much tess or" a : problem than under the proportion correct scale. In addi- 
tion, the trans formation can account for the skewness that often occurs with 
observed score distributions associated with a mean (proportion correct) that 
ts not close to . 5 . 

The BINORM model is similar to the BETA in that both begin with the 
binomial tor the conditional distribution of observed score. The contrast in 
the assumptions about the distribution of true score enables examining the sen- 
sitivity of tolerance intervals t.o such assumptions. 

The BINORM and CONORM tolerance intervals provide a comparison of a dif- 
ferent nature. As in the BINORM model, the distribution of transformed true 
>3core assumed normal for the CONORM. But under che CONORM model, the conditional 
distribution of observed scores is assumed compound b inomial • ra ther than binomial. 
The two-term approximation to the compound-binomial suggested by Lord (1965) T 
simplifies considerations in the modei . Noting that the conditional variance 
under the binomial is nr(l - x) , the conditional variance under the two-term 
:ipprosimation is (n - 2k) -(l - t) where k is a parameter to be defined." 4 Thus, 
with k > 0, shorter intervals can be expected under this model, all other things 
beirvg equal. 

U From here on, T can be interpreted as a particular true score or the ran- 
dom variabLe for true score, depending on the context. 



V 

The appeal of thit: approximation in our case is chat it provides an alter- 
native conditional distribution for bounded observed scores and 
that the overall error variance (across examinees) can be made to correspond 
(with an appropriate choice of k) to an estimate of average erroi: variance ob- 
tained under weaker assumptions. Lord (1965) emphasizes the fact that k can be 
chosen .so that average error variance corresponds to that which would be ob- 
tained by using a KR20 estimate of reliability. 

The use of an angular transformation with Lord's two-term approximation 
was previously suggested by Wilcox (1978). Because the conditional variance is : 
(n - 2k)r(l - t), a variance stabilizing transformation that is appropriate 
for the binomial is applicable here also. 
Calculatio n- of Tolerance Intervals 

As noted above, we presume that accurate estimates of population parameters 
are available. In other words, we take the liberty of providing details about 
calculating intervals given the parameters, while, at the same time, providing the 
estimation equations used for the example tests that follow. Under the BETA and 3 I NORM 
models, estimation simply involves calculating a mean and variance of observed 
scores. Estimation for the MORM and CONORM models additionally involves the 
calculation of the variance of Ltem difficulties. 1 " This holds for these two models 
because interest is restricted here to a RR20 estimate of reliability for the 
examples provided. More generally, other estimates of reliability could be used 
for these two models. 

Note that all the intervals that are calculated are for a proportion 
correct true score scale. The observed number correct score it, still referred 
to as >: for a particular score and ^ for the random variable. Thus, a true 
score lor an examinee is defined as the expected number correct score divided by 
the number of items (n) 



20 



The expression for tolerance intervals under the NORM model has been given 

in Equation 5. However, there are slight: changes in the expression because of 

the change in the true score scale. The conditional distribution of t given 

X = x is normal [r(xj , j?j , where-T(x) = u + o^(x/n - u) , J = £ X/n, 

- C0RR(r, :<)*-, and a 2 = (1 - j±.)<jZ . Thus, the lower limit of a tolerance 
r a e tX X/n 

interval on the proportion correct scale (the 100*ct/2% point of the conditional 
true score distribution) is 



t(x) - z^ /2 o rX o- ■ (6) 



and for the upper limit (100*[1 - :t/l]% point) c. plus replaces the minus. 

For the example data used here, it seemed appropriate to estimate o^^ by 

KR20. Mote thai this means error variance is a function of n> as is the case 

for the other three models. It also means that the tolerance intervals can be 

expressed in terms of just three population parameters: E X/n, a £ , and a 

X/n 

parameter for the variance of item difficulty: This last parameter will be dis- 
cussed later under the CONORM model. 

Under the EETA, the conditional distribution of X given r is binomial, 
and the distribution of t is beta with population parameters a apd b. This 
makes the conditional distribution of t given x beta(a + x, b Hh n - x) . Using 
the usual notation for the cumulative distribution function of a beta, the 
Lower limit is calculated by solving 



I (a + x, b + n - x) = i/z (7) 



24 



ERIC 



21 



for. Lj where L is the point below which 100*a/2% of a beta (a + x, b + n - x) 
falls. Similarly, the upper limit can be determined by solving 

L,(a + x, b + n - :<) = 1 - ~j.lt * ■ (8) 

u 

. - / " ' 

for L' . The inverse beta function subroutine MDBETI of IMSL (1979) was used 
for the calculations. 

Note that this choice of L and U provide central tolerance intervals. Because 
the SETA is asymmetric when its parameters are unequal, shorter 1 - a intervals could 
be found than central intervals. However, interpretations of L and U -would then 
varv from one x to the next; i.e., the tail areas beyond each limit would change. 

Under the BETA model, convenient estimates of the true score distribution 
parameters are: 

a = n(L/KR2l - Uu , . ( 9) 

b = n(l/KR21 - i) - 3 , (10) 

where t is a mean (proportion correct) observed score (see Lord & Novick, 1968, 
pp. ,516-517 and pp. 520-521, and note that their "b" differs from ours by n - 1). 
Since KR2 1 is a function of n and the mean and variance of observed scores, 
just these two statistics are necessary for calculating approximate intervals 
under the BETA. (Recall that it is assumed that sample sizes are large enough 
to provide accurate estimates of the observed score mean and variance.) 

For the BINORM model, the angular transformation suggested by Freeman and 
Tukey (1950) is used: 



SIN 




ERIC 



L'ncter the binomial, this transformation is considered to provide the most sta- 
bility in variance among the angular transformations suggested for this distriS 
bution (Hosteller & Tukey , 1968), and is the transformation used in most of the 
3ayesiari references given above. The conditional distribution of G (variable for 

given r is, approximately, normal with mean y r = SIN M ^Tr and* variance 
(5n + 2) ^ , and the y variable is assumed normal. Thus, tolerance limits can 
be calculated under the transformation by using the fact that the conditional 

distribution of v given g is, approximately, normal [y (g) , p~ /(4n 4- 2)], where 

yG - 



(g) - E Y + o;- G (g - EG) 



ana 



= ± - [^(4n + 2)| 1 



Thus, a central L — "x tolerance interval on the v scale is 



18) ± ' z . J r fSa + 2) * . (12) 



ilitf inverse transformation (SIN)- is then applied to the limits to return to the 
original true score scale. 

The needed mean and variance (E G, • can be estimated in a number of 
wnys; A simple approach is to apply the Freeman-Tukey transformation to the 
observed scores and to calculate their mean and variance as estimates of 



- G and'~~ . Since E y is approximately equal to E G , the mean of transformed 
observed scores can be used here also^ 



ERIC 



23 



For Che typical situation in which the mean and variano- of observed propor- 
tion correct scores are already available, a more convenient approach to estimation 
employs a Taylor series approximation (see, e.g., Johnson & Kotz, 1969, pp. ^28-29) 
for E G and j~ . Under the Freeman-Tukey transformation, 

(.7 



E G 



SIN 1 -y /r p + SIN 



+ 



X 



16 (n + 1)- 



1 - 2p 



1 - 2q 



+ 



Ep(i - p)] J/2 [qd - q)3 



-V 2 



and , 



(13) 



16(n + l) 2 L 



[p(l - p)j h + [q(l - q)] h 



r ■ 



(1^) 



where p = 'jn / (n + I) and q = ( + 1) / (n + 1) . Thus, accurate estimates of u = E X/n 

and 7rz = 7 are all that are needed for the BINORM intervals." 5 

a X/n 



Calculations for the CO NORM model parallel those for the BINORM; i.e., tole- 
rance limits are calculated under the Freeman-Tukey transformation to normal error 
md true score and then transformed back to the proportion corrf * scale. For 
this model we refer to true score under the transformation as n = Si-N — 



in con- 



trast to y above; The distinction is made because of a difference in variances, 
-rider the CONORM, the conditional variance of G given t is (n - 2k) /(4n*- +■ 2n) , 



approximately. This implies that with k > 0, 



> o 



nG y g " 



Again, this estimate of E G can be used for E y since the two parameters 
are approximately equal. However, the following Taybr series approximation can 
also be used: 

•1 



SIM 



+ 7^ (1-2t) / [8(— : : )~] 



where 7^ can be estimated from u and . For the examples, the tolerance 
O mits reported in two decimal places do not differ under the two approaches. 

ERIC 



27 



24 



Tote ranee intervals under the n scale are expressed as 



n(g) ± z. /7 o nG [(n - 2k)/(4n 2 + 2n)p , (15) 



where 



(g) = E n + D^ G (g -EG) , 



and 



D 2 n = 1 - (n - 2k)/[a^(4tr + 2n) 
nG ^ 



The (SIN) 2 transformation provides limits on the t scale. Comparison of Equations 11 
and 15 indicate that with k > 0, CONORM intervals will tend to be shorter and less 
regressed to the mean than BINORM intervals. 

En addition to estimating E G and for the CONORM, an estimate of k is 
aiso needed. As suggested by Lord (1965, p. 266), 

Sr(n - 1) 

k = ^ = 2n Sr , (16) 

l 



2[0U - u) - % rn ~ s ± /n ^ 



wnere 

n - 
= I i7/n - i 2 , 

1 j=i J 

L , being the item difficulty (proportion correct) of the j-th item, and Q being the m 
nroDortion correct for the sample on which the difficulties are based: As noted 



2 



6 



I 



25 



above, this estimate of k makes the average errcr variance on the observed score 
scale the same as would be obtained through a KR20 (Tucker, 1949, expresses KR20 
in terms or u , J.*: , , and S 4 : ). 
Examples 

Tables 1 through 4 provide selected tolerance limits for four different but 
realistic test characteristics. Test characteristics refer to the four parameters 
sufficient for calculating tolerance limits for all four models; namely, n, E X/n, 

, . , and S~ for KR20). The characteristics used are realistic because thev 

a / n jl 

were taken, with one exception, from established standardized tests, and are dif- 
ferent because they allow contrasts between long and short tests and symmetric 
and skewed distributions. 

Insert Table 1 about hare 



Table 1, with n = 35 and E X/ri = .5, provides tolerance limits for observed 
number correct scores of 7, 14, 21, 28, and 35. Columns headed x/n and N.H. Dens, 
provide proportion correct scores and the corresponding negative hypergeomet ric 
densities that are associated with the BETA (Lord & Novick, 1968, pp. 515-520). 
The three tolerance coefficients, 50%, 68%, and 95%, were chosen to provide in- 
dications of differences in the conditional distributions at different percentage 
points. Also, these three coefficients have had historical popularity (50% for 
setting "probable error" intervals and for the interquartile range, 68% for one 
standard error intervals). Note also that the 95% intervals in Table 1 are, 
approximately , three times wider than the 50% intervals and twice as wide as the 
68% intervals. 

23 



9 

ERIC 



26 



The similarity of the limits of the four models is striking. With the 
exception of the extreme observed score, 35 correct, and excluding the limits of 
the NORM model, the limits differ by no more than .01. 

Considering the NORM model, the largest differences with the other models 
are at the extreme scores. Recall that under the NORM model, intervals for every 
observed score are the same width. Because error variances of the other three 
models decrease as true score moves away from .5 (the mean in this example) , 
narrower intervals are found at the extremes. In Table 1 this can be seen at 
the score of 35, and to a lesser extent at scores of °.8 and 7. Around the mean, 
all four models provide limits that are quite close- Notice also that limits 
under the NORM model can be below zero or above one. 

Recall that error variance under the CONORM is smaller than under the BINORM 

A 

to a degree that is dependent on S? (or equivalently , the difference between 
:<R21 and KR20) . The effect of this on the limits in Table 1 appears to be 
slight but predictable. The intervals for the CONORM model are shorter by 101, 
approximately. Also, the differences are primarily reflected in the upper 
limits when observed scores are below the mean, and the lower limits when observed 
scores are above the mean; i.e., these limits, under the CONORM model, are -01 
more distant from the mean (less regression) than under the BINORM . 

The error variances for the BETA and BINORM models are the same 
but the shape of the true score distributions are different. From Table 1, the 
effect of this differ' ce appears to be mainly on the extreme score of 35. The 
lower Limits under the BETA are closer to the mean than under the BINORM and 
CONORM. A slight but opposite trend is found at scores of 7 and 28; i.e., the 
intervals under the BINORM are Slightly closer to the mean than under the BETA. 

3U 



9 

ERIC 



27 



These detailed descriptions of differences in Table 1 seem insignificant, 
but they do reflect some general patterns of differences that are discussed later: 

The example in Table 1 does not provide a sufficient comparison of the models, 
because under all four models the true score distribution is symmetric (£ X/n = .5^. 
In Table 2, skewed true score distributions are introduced with E X/n = .75. 
L' rider the BETA model, the left skewness introduced by this mean is- reflected in 
the negative hype rgeome trie densities. Of coarse, under the NORM model there is 
no skewness. Under the BINORM and CONORM models, skewness is allowed for through 
the expansion, above the mean, of the transformed true score scale. 



Insert Table 2 about here 



The effects of skewness on differences between the NORM and the other three 
models is quite noticeable. At observed scores below the mean (7, 14, 21), the 
upper limits of the NORM are further from the mean than those of the other 
models. Also, the intervals for these same scores are narrower for the NORM than 
for the others, but for scores above the mean (28, 35), the intervals are wider for the 
NORM. Comparing this with results from Table 1 (same ri) , the left- skewness appears 
to affect the width of the intervals at scores below the mean, making them wider 
than for a symmetric distribution. Also, scores above the mean have narrower 
intervals than those of Table 1: This result can be intuitively understood by 
considering the density of the true score distribution below and above the mean and 
its effect on the conditional distribution on which the limits are based. 



ERLC 



3± 



28 



The clear pattern of differences among the BETA, BINORM, and C0NORM models 
that were found for the symmetric distributions associated wxth Table 1 are 
not as apparent in Table 2. Still, as expected, the CONORM intervals are typically 
shorter and sometimes further from the mean. Also, differences among limits 
for these three models are again small — just 12 out of 90 possible differences 
are greater than .01, and 10 of these are .02. 

Table 3 contains intervals that are to be compared with those of Table 1. 
The test characteristics for Table 3, rather than being calculated from an exist- 
ing test, were derived from those in Table 1. Note that the mean and are the 
same in both tables, but that n is 25 in Table 3. The observed score variance 
in Table 3 was derived by keeping the KR20 estimate of true score variance the 
same in both tables and increasing error variance by the multiple 35/25. 



The increase in widths of the intervals due to the decrease ir. n can be 
expressed algebraically for the NORM. So, under the NORM, the differences in 
widths between Tables 1 and 3 follow a simple pattern. For the 50% intervals, 
the differences are .014, for the 68%, they are .02, and for the 95%, they are 
.04. Considering the 95% interval widths for the 35-item test, a 13% decrease 
in width is obtained frcrr; a 40% increase in n. Also, there is less regression 
toward the mean that cones with the higher reliability of the 35-item test. 
For the other three models, similar differences are found, but there is less 
consistency in their pattern . 



Insert Tab le 3 about here 




29 



Similar to Table 1, when the limits for the NORM and chose for a perfect 
score of 25 are excluded, differences among limits for the three other models are 
mainly : 00 or .01. There are, however, six differences equal to .02, indicating 
that intervals tend to differ more with smaller n. 

Table 4 provides intervals for a 100-item test (E X/n = .75). Note that 
for'this long test the reliability is below .9. The test characteristics reflect 

those of a certification examination that has a small true score variance. The 
intervals here are much smaller than in the other tables. For the example in 
Table 1 (n = 35), the reliability is similar to the 100-item test, but it has 95% 
intervals that are almost twice the length of those in Table 4: Clearly, the num- 
ber of items plays the primary role in the width of intervals for all the models 
6 

considered here. 



Insert Table 4 about here 



with the exception of the perfect score of 100, all four models have very 
similar limits. This is in contrast to the limits of Table 2 in which skewness 
introduced by E X/n = .75 made the limits for the NORM quite distinct from 
those of the other models. Actually, the coefficient of skewness under the 
BETA is smaller for the example in Table 4 than for Table 2, but the effect of 
skewness on the intervals is still noticeable. For example, under the BETA, 
scores below the mean are associated with wider intervals than those above the 
mean, while the NORM intervals are a constant width. 

6 

For all four models the error variances depend primarily on n, but recall 
chat this neod not be the case for the CONORM and NORM models in which one has ah 
option of using an estimate of overall error variance different from that obtained 
from a KR20 : 

o 33 
ERIC 



30 



Some General Comparisons 

To obtain a more general idea of differences among the models, mean-absolute- 
dirrerences were calculated using all the limits from the example tests in Tables 
1 through 4 as well as those from three other examples. Table 5 contains these 
means which were calculated by contrasting limits for the six possible pairs of 
the four models. As an example, consider the first test depicted in Table 5 
(n = 2 5, E X/n = .5). Here we find the mean-absolute-difference between the 
lower 95% limits of the BETA and the BINORM models is .'006. This mean appearr 
consistent with Table 3, in which most differences in limits between these two 
models are either .00 or .01. 

Insert Table 5 about here 



Each mean-absolute-difference was calculated by contrasting limits for a 
pair of models at each observed score, and by weighting ' each absolute difference 
by the negative hypergeome t ric density associated with the BETA model. This 
weighting was especially valuable for the longer tests. Consider the 100-item 
test with E X/n - .75. From empirical data and according to the density func- 
tion, there are very few, if any, examinees who score below 20 on this CdSt; 
So, it seems clear that some function is necessary that avoids weighting differences 
at scores below 20 in the same way as differences around the mean observed 
score Otherwise, mean-absolute-differences could fail to reflect the nature 
of the differences that occur in practice*. 

An obvious trend in Table 5 is the decrease in mean-ab so lofce-dTf Terences 
with an increase in number of items. Of course, the intervals are also shorter 
for the Longer tests. However, from other calculations, it was found that the 
oercentage decrease in widths for longer tests is less than the percentage de- 
crease in mean-abso'lute-dif f erences ; i:e., the decrease in raean-absolute-dif f erences 
q is not simply a resulC*of a decrease in the widths of intervals. 

ERIC ' rA 



s 



31 



Another result from Table 5 i-» that the largest means are found for the 
contrast of the NORM with the other three models. This is consistent with the 
selected limits of Tables 1 through 4. Of these differences, some of the largest 
occur with contrasts of the up,per limits for the examples with E X/n > .5. 
Recall that the other three models have j^eft-skewed true score distributions 
when E X/n > .5. In effect, these results are an indirect indicator of 
the fact that tor observed scores below the mean the upper limits for the NORM 
model are farther from the mean than those of the other three models. A more 
direct indicator is the mean difference (with sign) of the upper limits for 
observed scores below the mean. For the examples in which E X/n > .5, mean 
differences of upper limits (NORM-o the rs ) are al'l positive. For the two cases 
in which E X/n = .5, the means are also positive but much smaller. Table 6 
provides means for three examp 1,3 s. 



Insert Table 6 about here \ 

Returning to Table 5, the mean-absolute-dif f erences_ between the BIMORM and 
CONORS! models (different error variances) do not seem any larger than differences 
between the BETA and BIMORM models (differences in shape of the true score dis- 
tributions) Recall uhac the larger the value of S~ the greater the difference 
in error variances between the CONORM model and both the BINORM and BETA models: 
This makes the CONORM intervals shorter and slightly less regressed to the mean. 
Apparently, the values of are not large enough to cause important differences 
in limits. Since the values of used here seem typical of standardized tests, 
the small effect of this parameter on the limits can be considered general. 

* ( 



32 



There is a convenient contrast or in the examples. Consider the two tests 
with n = 35 - One has an S*: that is 50% larger than the other. Table 7 pro- 
vides the mean widths of the intervals for thesQ two examples (the negative 

i 

hype rgeome t ric density is used for these means also). Notice that the di. ■£ £e-re-nc- es 
in mean widths between the B I NORM and CONORM models for the test with = .027 
are about 50% larger than the differences in mean widths for the test with » .018 
Still; the differences for the larger represent just 8% of the mean widths 
of CONORM intervals, and differences for the smaller are only 4% of the mean 
widths . 



Insert Table 7 about here 



Table 7 also contains the mean widths of intervals for chc example with 

_ j 

n = 100. These are provided as an indication of the decreas^irr_wid th that 
comes with an increase in n. 

Plots of interval widths against observed scores were made for all the 
examples. The plots are not included here, but their general nature can be des- 
cribed. From Equation 6, the interval widths for the NORM model are constant 
across the observed scores. For the other three models, the plots are similar 
and depend on the mean. With E X/n = .5, the interval 



-3 b 



33 



widths increase from zero to the :nean, and then decrease symmetrically from 
the mean to 1.0. Around the mean, the widths for the three models are Larger 
than for the NORM arid smaller otherwise. Fdi the examples with E X/ri = . 7 5 , 
the intervals are approximately the same width up to the mean, and then they 
decrease from the mean to 1.0. They decrease at a faster rate past the mean 
than when E X/n = :5 and- end up (at 1.0) with a smaller width. 

Differences between the BETA and both the BINORM and CONORM models that 
were noted in Table 1 were found more generally for ail examples. Plots of 
^differences in limits against observed scores reveal that the largest differences 
between the BETA and both the BINORM and CONORM models occur at very extreme 
observed scores. For very low scores, the upper limits of the BETA are closer 
to the mean than those of the BINORM and the CONORM. Similarly, for very high 
scores, the lower limits of the BETA are closer to the mean. 

The plots of the differences also revealed that an opposite but slighter 
trend occurs for scores that are not extreme. That is, for such scores that 
are Lelow the mean, the upper limits under the BETA are slightly farther from 
the mean than under the BINORM, and for .such scores that are above the mean, the 

. ... _ . . _ _ __ . 

lower limits under the BETA are further from the mean than under the: BINORM. 

A similar change in trend occurs tor the BIN0RM-C0N0RM contrast- These results 

are most apparent for the symmetric true score distributions, and they were noted 

in Table I: For a skewed distribution, the trend is mitigated. 

S umma ry and Cone las ions for Tolerance Comp aris ons 

The detaiied differences in the tolerance intervals for our examples appear 
to follow a pattern, and many of the differences reflect what was expected from 
differences in the models. In this sense, the difference can be considered 
genera lizab le to other realistic test characteristics. 



f 



The. NORM mode], seems inappropriate for number correct 
scoring: The bounded nature of a -proportion correct score s^caie is an 
apparent problem, and the assumption of independence of error and true score 
^without a transformation) seems unwarranted (Lord, 1960). These issues are 
reflected in differences be :. we en the NORM intervals and the other three models, 
especially for shorter tests. But, recall that for longer tests the intervals 
.ire quite similar for all four models. 

The differences among the BETA, BINORM, and C0N0RM intervals seem unimportant. 
Because th£ 3ETA model has been frequently discussed in the literature, appears 
useful for a variety of applications, and does riot involve approximations, one 
might feel satisfied in calculating intervals under the BETA and ignoring the 
other two models. However, the intervals that were calculated under the GONORM 
were based on KR20, whereas a different estimate of reliability could be incor- 
porated: In other words, the small differences found between CONORM and BETA 
intervals were based on typical values of 3^ , snc might have been larger if 

reliabilities were estimated in a different manner. 

\- 



From the results, there seems to«be little reason to choose the BINORM 
over che BETA model for calculating intervals. However, it could serve as a 
substitute for the BETA, especially since the BINORM model has some mathematical 
conveniences that might prove useful for the problem of estimating tolerance 
intervals with small sample sizes. 

' The NORM model is quite distinct from the others, yet the tolerance inter- 
vals for scores not at the extreme were similar to the other models. Since 
average error variances were similar for all four models, the comparisons can be 



,38 

ERIC 



35 



considered to be among differences in Che sizes of error variance at different 
score levels and in the shapes of true score distributions; One can conclude 
that the small differences in intervals, at other than extreme scores, indicate 
that tolerance intervals are not very sensitive to differences in shapes of 
true score distribution or in assumptions about the variability of error variance 
along the true score scale. However, all four models do have regularly-shaped 
distributions and differences among them in error variances, at other than extreme 
scores, are not that large. 



ERIC 



36 



Comparison of Confidence In tervals 
Three error models are used for calculating confidence intervals in the 

examples below: normal error with* equal variance for all examinees, binomial 

error, and compound-binomial error. 

The normal intervals are of the form 



Note that t is calculated through a KK20 for the examples below, and the same 
values of j_ wen? used for the NORM model tolerance intervals in Tables 1 through 4 

For the binomial error model, there are' many published tables ^specif icaily 
developed for confidence intervals on a binomial parameter. See Kendall and 
Stuart (1979, p. 129), and Johnson and Kotz (1969, p. 59) for references. How- 
ever, none of the available tables provide confidence intervals for the 50% and 
68% coefficients, so these calculations had to be performed for this paper. The 
calculations are straightforward enough to be generally useful. Some details 
about the calculation of these intervals are reported below to allow an analytic 
comparison with tolerance intervals under the BETA model. 

Most of the published tables on binomial confidence intervals were generated 
by solving the following equations for the lower (L) and upper (U) limits of the 
intervals : 



L 5 (1 - L) n 3 = a/2 , (18) 




4 u 



J (j^ t' j (i - H) n "3 = a/2 



Here, x is Che observed number of successes (correct) in n trials (items). 

Because of the discrete nature of the binomial, it is not possible, in 
general, to construct intervals with a particular coefficient. Intervals con- 
structed from Equations 18 and 19 do have a coverage probability greater than 
or equal to 1 - x ; i.e., 



where t is the binomial parameter and L and U are now considered random variables 
that are functions of X rather than x. Kendall and Stuart (1979, pp. 113-116 and 
pp. 129-131) provide a discussion about the issue of inexact intervals for the 
binomial: And, Wiiks (1962, p. 368) provides a general theorem for setting con- 
fidence intervals for discrete variables. 

Intervals constructed from Equations 18 and 19 are referred to as central 
intervals. This is because, in addition to the claim made in Equation 20, 
p(L ± t) ^ 1 - -x/ 1 and P(U > tt) > 1 - . These two additional 

statements seem to be a desirable feature of confidence intervals, and most 
tables are set up this way. However, by relinquishing these two claims, i.e., 
only requiring Equation 20 to hold, shorter noricentral intervals can be calculated. 
Crow (1956), among others, provides such intervals. 



P(L < tt < U) > 1 - a 



(20) 





38 



Equations 18 and 19 can be expressed in terms of the cumulative distribu- 
tion function of a beta. Equation 18 can be written as 



•I L (x, n - x + 1) = cx/2 . (21) 



Thus, one can enter a beta table to find the L that corresponds to ot/2 , or as 
was done for tha tables below, use a computing routine for finding the inverse 
of a beta [IMSL (1979) subroutine MDBETlj. Similarly, for Equation 19, the 
upper limits can be determined by solving 

I (1 _ u} (n - x, x + 1) = j/2 (22) 

for l. >The F distribution can also be used; see Johnson & Kotz, 1969, p. 59.) 

Recall Equation 7 for the lower limit of a tolerance interval under the BETA. 
Note that if a = 0 and b = 1 in that equation, it woulc 3 equal Equation 21; 
making equal the lower limits of the binomial confidence interval and the BETA 
toierince interval. Equation 8 for the upper tolerance limits can be reexpressed 

as I ; ■ (b + n : - x, a + x) = a/2 * Note that a = 0 and b = 1 do not make 

I l ) 

this equal to Equation 22. Clearly, it is not possible to choose the a and b" 
parameters of the true score distribution such that the confidence and tolerance 
Limxc? are the same. This is not surprising given the different nature of the 
intervals. Consider also that under the binomial we can only make inequality 
statements because the coverage probability is a function of the discrete variable X 
Under the BETA model, we make exact coverage probability statements because the 
variable t £,iven x is continuous. 



39 



For compound-binomial error, the two-term approximation which was dis- 
cussed under the CONORM model was also used for error variance here. Recall that 
error variance under the approximation is (n - 2k) / (4n 2 + 2n) , where k was 
chosen to make average error variance the same as that calculated from a KR20 . 
Also, recall that this made average error variance the same for the CONORM and 
NORM models: 

Intervals for the compound-binomial are only approximations. The Freeman-Tukey 
transformation was used to yield approximate normality with constant variance. 
Intervals were then calculated £g ± z a /2^ n " 2k) /(4n 2 + 2n) ] , a continuity 
correction was added, and a transformation back to the proportion correct scale 
was applied. 
Two Examples 

Two tables are provided for comparison of confidence intervals under the 

three error models. Table 8 contains intervals for a test with n = 35 . Thj 

nrror variance, a 2 , for the normal error model corresponds to error variance under 
e 

the NORM model for Table 1. Similarly, the same value of k was used in Tables 1 and 
S. Table 9 has n = 100 and corresponds to parameters used in Table 4. 



Insert Tables 8 and 9 about here 



From Tables 8 and 9, confidence intervals under the three models are similar 
except at extreme scores. At the extreme score of 35, for example, all three 
error models have quite different limits. Typically, the normal error intervals 



ERLC 



43 



40 



extend beyond 1.0 and are much wider than intervals for the other two models. 
The compound-binomial Intervals at 1.0 appear quite short relative do the 
binomial. This is not true at other observed scores, and seems to reflect prob- 
lems with the properties of the transformation or the approximations at this 
extreme score. 

At other observed scores, the binomial intervals are, for the most part, 
Longer by .01 or the same as the compound-binomial intervals. This reflects 
the difference in error variance under the two models. Recall that k depends on 
s| and that k for Table 8 is associated with the largest S? in the examples. 
Also, k for Table 9 is the largest among all the examples. 

Error distribution shapes affect the intervals in Table 8. Under the nor- 
mal error distribution, the intervals are symmetric about the proportion correct 
score. In contrast, under the other two models, the distributions are 
skewed toward .5. For these two models, the lower limits are more distant from 
the observed score than the upper limits when the observed score is above the mean. 
The reverse holds for scores below the mean. This is not as noticeable for n = 100 
in Table 9. 

Comments £n the 3inomial Error - ^4o4e-jr 

Under special circumstances, the binomial error model can be said to hold 
by definition (Lord & Novick, 1968, chap. 11, & chap. 23, p. 524; Lord, 1957). 
If test forms are constructed by random sampling of items and the proportion cor- 
rect true score of interest is defined by the domain from which items are 
sampled (rather than for a particular sample of items), the binomial error model 
holds for any particular examinee as long as item responses are independent from 



44 



41 



one item to the next for. that examinee ( independence^o^f : responses is violated 
by context and other similar effects). Gross and Schulman (1980) provide a suc- 
cinct justification of- the binomial under such circumstances, and contradict 
some statements made by van der Linden (1979) in his claim of deterministic 
assumptions underlying the binomial. 

The binomial error model is often criticized because items are not the same 
difficulty. It is true that the binomial distribution cannot be used for the 
joint distribution of error of examinees that are administered the same set of 
items. Errors are correlated across examinees. But when we isolate interest to 
a particular examinee under the circumstances above (random sampling of items, 
etc.), the distribution of observable scores for t.hat examinee is binomial and 
it follows that confidence intervals based on the binomial are appropriate. Of, 
course, this does not consider the nature of errors made in providing such con- 
fidence intervals for the set of examinees administered the same test form. 

In any case, tests are not typically constructed by random sampling. For 
example, items are frequently sampled from fixed categories (Jar j oura & Brennan, 
1982, provide a model for\such circumstances). Also, test form difficulty and 
other adjustments are typical of standardized testing. It is 'usually judged that 
these factors make average error smaller than under the binomial, and binomial in- 
tervals are often viewed as conservative. Still, violation of other assumptions, 
tike independence of item responses for an examinee, can make error larger than 
under the binomial. Binomial intervals can be considered a useful approximation 
as long as average error variance , estimated without resorting to b inomial assump- 
tions, agrees with that estimated under the binomial (Cl - KR2l]a^^) , and as long 
as there is no evidence that error variances at different points along the score 
scale are larger than under the binomial. 



/ 

f 



Comparison of Confidence^ and Tolerance Intervals 

A comparison of Tables 1 and 8 provide an idea of differences between con- 
fidence and tolerance intervals under the same test characteristics. For the 
binomial, the fact that n = 35 is enough to allow comparisons across the tables. 
Recall that average error variance from the KR20 of Table i was used in deter- 
mining error variance for the normal and compound-binomial intervals. 

Contrasts between the BETA tolerance intervals and binomial confidence 
intervals reveal, as expected, Chat tolerance intervals are typically narrower 
and shifted from the observed score toward the mean. Differences in limits are 
most apparent at extreme scores. Note that the contrast in interval widths 
reverses at the extreme score of 1.0: Similar differences are found for con- 
trasts between the normal and NORM intervals and between the compound-binomial 
and CONORM intervals. The BETA intervals of Table 2 can also be compared 
directly with the binomial intervals of Table 8. Here, we find some large dif- 
ferences at the low scores that are distant from the mean. 

Direct comparisons can also be made between Tables 4 arid 9 . Recall that 
with n = 100, tolerance intervals for all four true score models are quite 
similar. In contrast, differences between confidence and tolerance intervals 
are large at scores than are distant from the mean. Consider, for example, the 
observed score of/ 20; There, 50% confidence and tolerance intervals do not 
even overlap, and for 68% intervals, the upper limits of the confidence intervals 
are the same or close to the lower limits of the tolerance intervals. The major 
reason for such a difference is that the observed score (20) is approximately 3.7 



4i> 

ERIC 



43 



standard deviations below the mean (75); This difference might be considered 
unimportant because examinees do not score that low (empirically no one has 
scored this low on current forms of this example test). But the contrast does 
dramatize points made earlier. When we are conditionally interested -in examinee 
a, then, from the perspective taken here, we are isolating interest in that 
examinee's distribution of observed scores, not in the distribution of scores of 
other examinees. This is not to say that information about other examinees 
cannot be used in interpreting a confidence interval. The point is that if we 
want a confidence interval for a particular examinee, then that interval is not 
designed to take the performance of other examinees into consideration. In 
contrast* when we condition on observed score, we are formally interested in 
observations from the population of examinees; i.e., in the associated distri- 
bution of true scores. Information that an observed score is very unlikely 
is obviously important and affects the nature of the tolerance interval. 



4 7 



ERIC 



44 



Discussion 



Sach strong assumptions as used for setting tolerance or confidence inter- 



vals need to be checked. Some methods for checking are discussed beiov. 



Also , 



Bayesian credibility intervals and confidence and tolerance intervals are con- 
trasted. 

Checking Assump t ions 

All four true score models are specific about what to expect for observed 
score distributions. Thus, the usual chi-square test of fit could be calculated and 
differences between observed and expected frequencies examined. None of these 
models are likely to closely fit observations. However, consider the possibility 



that the. BETA fits but the BINORM does not. Under such circumstances, one would 
prefer the BETA tolerance intervals, but, from the results above, they would not 
differ substantially from those of the BINORM. 

If one assumes that an approximate compound-binomial error model is appro- 
priate^ then procedures developed in Lord (1969) and implemented in a computer 
program by Wingersky, Lees, Lennon, and Lord (1969), can be used to estimate a 
"smooth" true score distribution without specifying its form. This could be 
compared to a beta or the other true score distributions assumed in the models 
above in order to determine if there are large discrepancies. For example, the 
estimated distribution might be noticeably bi-modal or might be truncated at 
some point above zero. Clearly, this could cause problems in tolerance intervals. 
Lord and Stocking (1976) derive a procedure for setting simultaneous confidence 
intervals around the conditional means for true scores at every observed score. 
They assume the binomial error model but do not specify the true score distri- 
bution. These intervals could be compared with the conditional means that are 
specified by each of the four true score models. Also, Wilcox (1981) reviews 
procedures for checking the beta-binomial assumptions. 

For the BETA, BINCRM, and CONORM models , the true sec re distribution is 
bounded by zero md one; The possibility of guessing corxectiy in maltiple choice 




tj A 



45 



tests is often considered to imply that true scores do not* extend down to zero.' 7 
Also, evidence of this effect has been found by Lord (1965) in u£-?ng a four 
parameter beta distribution for true scores (two of the parameters are end- 
points). For a true score distribution that ends, say, at .15, tolerance 
intervals of number correct scores near zero would obviously be affected. This 
is rather unimportant if few examinees score near or below a guessing level 
(as in the case in most of the example tests above). Otherwise, a nonzero 
end-point should be considered in setting tolerance or confidence intervals. 

Perhaps the most important checking is with regard to measurement error 
variance. Both confidence and tolerance interval widths are, £c: the most 
part, determined by error variance. And, under the above models, assumptions 
about error variance are quite strong. These assumptions could be checkeyi, 
if deemed appropriate, by obtaining realized values of the error variable in 
a parallel forms study. A simple check on the binomial or the approximation 
to the compound-binomial error variances would involve transforming the ob- 
served scores (Freeman- Tukey) , estimating error variance for appropriate ranges 
of observed scores, and comparing these with the constant values specified by 
the two models. If the estimated error variances are fairly constant- but 
different from that specified under either model, this constant could be used 
nor estimating k differently from that given in Equation 16. 
Bayesian Credibility Intervals 

With a Bayesian approach, we can isolate interest in a particular 

examinee's true score and still interpret an interval set up for that true 

v . _ : 

score as covering a proportion of a distribution (posterior) of that true 

score for a given observed score or scores. This is because we start with a 



ERLC 



/ Note that true score is defined here as the expected proportion correct, 
not the expected proportion an examinee knows without guessing. 

49 



46 

distribution (prior) for the true score. This is in contrast with a confidence 
interval that does not consider a distribution for the true score. In a sense, 
a Bayesian approach appears to provide a more informed statement or inference 
because it uses information bes ides an examinee ' s observed score in de termining 
an inter/al for that examinee. As argued above, a confidence interval seems use 
ful in the situation in which a career counselor or classroom teacher is inter- 
preting a particular examinee's score. How a . confidence interval ends up being 
interpreted will likely depend on aii the- other information a counselor or 
teacher has about that examinee, and perhaps information about the performance 
of other examinees. In this sense, a confidence interval can be considered a 
less formal method of inference as compared to a credibility interval. 

Although a conceptual distinction exists between tolerance and credibility 
intervals, they ,can be made to coincide numerically. Consider that tolerance 
intervals under the BETA model are the same as central credibility intervals 
in the case in which every examinee is given the same prior (beta[a, b], where 
a and b are population parameters for the true score distribution) and the 
conditional distribution of observed scores is assumed binomial. It is not' 
clear that they could be made the same when estimation issues are considered 
for tolerance intervals. 

Conclusions 

4 

In consideration of issues regarding intervals for true scores, confidence 
intervals seem useful when score interpretation is intimately concerned with a 
particular examinee. In contrast, a tolerance interval is quite informative 
for interpreting a particular observed score with respect to a population of 
examinees. Also, knowledge that examinees who obtain a particular observed 
srjre likely have true scores within a 95% tolerance interval is a us 2f ul- 
adjunct to a confidence interval for a particular examinee. 

The claim that a conf idence interval procedure covers , on ave rage , the 
true scores of a population of examinees with sotie chosen probability depends 

5u 



47 



on weak assumptions. It is not a very informative ciaim with respect to a 
particular examinee. if such a ciaim is the basis for interpreting confidence 
intervals, their asefullness for a particular examinee is diminished. Further, 
when a population of examinees are considered in the interpretation of an 
observed score, tolerance intervals are to be preferred. 

Tolerance intervals provide simultaneously information about the discrimina- 
tion afforded by a measurement procedure for some population of examinees and 
information about the precision of measurement. Consider the possibility of 
narrow tolerance intervals relative to the proportion correct scale (high precision) 
combined with few, if any, nondverlapping tolerance intervals in the probable 
range of observed scores (low discrimination). This possibility can be trans- 
lated simply to low reliability and small error variance (relative to the propor- 
tion correct scale), but it does much to clarify the meaning of such a statement. 
Confidence intervals are lacking in this regard. 

Because tolerance intervals require the specification of the true score dis- 
tribution conditional on observed score, it was necessary to address the issue of 
sensitivity of the intervals to differing strong assumptions about the joint 
distribution of observed and true scores. For realistic standardized test 
characteristics, tolerance intervals are, for the most parti/ insensitive to 
differences in the shapes of true score distributions and to small differences 
in error variances And reliabilities. In contrast, it 'is clear that confi- 
dence and tolerance intervals are quite distinct, especially for scores not close 
to the mean. 



ERIC 



48 

t 

References 

Crow, E. L. Confident intervals for a proportion. Biometrika , 1956, 43, 423. 
Dixon, W. J. , & Massey, F. J., Jr. Introduction to statistical analysis . New 

York: McGraw Hill, 1969. 
Freeman, M. F. , & Tukey, J. W. Transformations related to the angular and the 

square root. The Annals of Mathematical Statistics , 1950, 2_1; 607-611. 
Graybill, F. A. Theory and application of the linear model. North Scituat-e, 

Mass.: Duxbury Press, 1976. 
Gross, A. L. , & Shulman, V. The applicability .of the beta binomial model for 

criterion- referenced^ testing. Journal of Edu cational Measurement 19 80 , 
17, 195-200: 

Hambleton, R. K. ; Swaminathan; H. , Algina, J., & Coulson . D. B. Criterion- 
referenced testing and measurement: A review of . technical issues and 
developments. Review of Educational Research , 1978, 4ji, 1-47. 

Huynh, H. Statistical considerations of mastery scores. Psychometrika , 1976, 
41 , 65-78. 

Inter-national Mathematical and Statistical Libraries. IMSL Libraries (7th ed.). 

Houston: Author, 1979.' 
J*ackson, P. H. Some simple approximations in the estimation of many parameters. 

&_ r-4t4-sh- Journal of Mathematical and Statistical Psychology , 19 72, 25 , 

213-228. 

Jarjoura; D.. , S Brennan , R. L. A variance components model for measurement pro- 
* cedures associated with a table of specifications. Applied Psychological 
Measurement , 1982, j5, 161-171. 



ERIC 



49 



Johnson, N. L. , & Kotz, S. Distributions in statistics : Discrete distributions . 

Boston: Houghton Mifflin, 1969. 
Keats, J. A;, & Lord, F. M. A theoretical distribution for mental test scores. 

Psvchometrika , 1962, 2_7; 59-72. 
Kendall, M. , & Stuart, A. The advanced theory of statistics (4th ed., Vol. 2). 

Mew York: MacMillan, 19 79. 
Lewis, C, Wang, M. , & Novick, M. R. Marginal distributions for the estimation 

of proportions in m groups. Fsychome trika , 1975, 40_, 63-75. 
Lieberman, G. J., & Miller, R. G. Simultaneous tolerance intervals in regression. 

Biome trika, 1963, 50, 155-168. 
Lord, F. Do tests of the same length have the same standard error of measure- 

ment? Educational -a nd Psychological Measurement , 1957, 17 , 510-521. 
Lord, F. M. An empirical study of the normality and independence of errors of 

measurement in test scores. Fsychome trika , 1960, 25_, 91-104. 
Lord, F: M. A strong true score theory, with applications. Psvchometrika , 1965, 

30, 239-2 70. 

Lord, F: M. Estimating true-score distributions in psychological testing (an 

empirical Bayes estimation problem). Psvchometrika, 1969, 34 , 259-299 . 
Lord, F . M., & Movick, M. R. Statistical theories of mental test s-cor^-s- . Reading, 

Mass.: . Addison-Wes ley , 1968. 
Lord, F. M. , & Stocking, M. An interval estimate for making statistical inferences 

about true scores. Psychometri to , 19 76, 41, 79-87. 
Mosteller, F. , & Tukey, J. W. Data analysis, including statistics. In G. Lindzey 

& E. Aronsen (Eds.), The handbook of social psychology . Reading, Mass.: 

Addison-Wesley, 1968. 

53 

ERIC 



50 



Novick, M. R. , & Jackson, P. H. Statistical method s f*>r- educational arid psycho-- 

logical " research . New York: McGraw Hill, 1973. 
Novick, M. R., Lewis, C. , & Jack., on, P. H. The estimation of proportions in a 

groups. Psychome trika, 1973, 3]$, 19-45. 
Proschan, F. Confidence and tolerance intervals for the normal distribution. 

Journal of the American Statis tical Association, 1953, 48, 550-564. 

Rao, C. R. Linear statistical- inference ^nd its applications . New York: 
Wiley, 1973 . 

Stanley, J: C. Reliability in R. L. Thorndike (Ed.), Educational Measurement 

(2nd ed. ). Washington: American Council on Education, 1971, 356-442. 
Tucker, L. R. A note on the estimation of test reliability by the Kuder-Richardson 

formula (20). Psychometrika , 1949, 14, 117-120. 
van der Linden, W. J. Binomial test models and item difficulty. . Applied Psycho- ; 

logical Measurement- , 19 79 , -J, 401-411. 
Wald, a., S Wolfowitz, J. Tolerance limits for a normal distribution. female 

of Mathematical Stat istics , 1946, JL7 , 208-215 . * 
Wallis, W. A. Tolerance intervals for linear regression in J. Neyman (Ed.), 

Second Berkeley Symposium on Mathematical Statistics and Probability. 

Berkeley: University of California Press, 1951, 43-51. 
Wilcox, R. R. Estimating true score in the compound binomial error model. 

Psychometrika ; 1978, _43, 245-258, 
Wilcox, R . R. A review of the beta-binomial model and its extensions. Journal 

of Edu cational Statistics , 1981, 6, 3-32. 
Wilks S. S. Mathemat ical statistics. Mew York: Wiley, 1962 . 

i 5 

Wineersky, M. S., Lees, D. M. , Lennon, V., S Lord, F. M. A computer program fog. 

s timating true-score 44^t ribu tions and graduating observed score dist-rlbu- 



es I 



' tions (ETS Research Bulletin 69-4). Princeton, N.J.: ^Educational Testing 

. — 54 

Service , 

o 

ERIC 



TABLE 1 



Tolerance- intervals for n = 35 , E X/n ~ .5, 
a 2 ,.. = .0423, and S 2 = .027 

-Xrh \ ~ --1 

Obs" x ~~ "beta binorh conorm norm 



Score n 


Dens . 


Coeff . 


L 


U 


L 


U 


T 


— U — 




— U— 






50% 


20 


29 


21 


30 


20 


.29 


19 


28 


7 .2 


.024 


68% 


18 


31 


19 


32 


19 


"31 


17 


31 






95% 


13 


38 


13 


39 


14 


38 


10 


37 






50% 


36 


47 


37 


47 


37 


. 47 


37 


46 


14 .4 


.046 


68% 


34 


49 


34 


50 


35 


49 


34 


48 






95% 


27 


57 


28 


57 


28 


56 


28 


55 






50% 


53 


64 


53 


63 


54 


63 


54 


63 


21 .6 


.046 


68% 


51 


66 


51 


66 


51 


65 


52 


66 






' 95% 


43 


73 


43 


72 


44 


72 


45 


72 






50% 


71 


80 


70 


79 


71 


80 


72 


81 


28 .8 


.024 


68% 


69 


82 


68 


81 


69 


81 


69 


83 






95% 


62 


87 


61 


87 


62 


87 


63 


90 






50% 


91 


96 


94 


98 


95 


98 


89 


98 


35 1.0 


.001 


68% 


90 


97 


93 


99 


94 


99 


87 


101 






95% 


83 


99 


88-^00- 


— W- 




an 


107 


Note, 


KR20 


= .87, 


KR21 = 


.86, 


k 


= 2. 


2, 







beta a = 2.953, and beta b = 2.953. Decimal points on 
limits are omitted. 



55 



TABLE 2 



Tolerance Intervals for ri = 35, E X/n = .75, 

al, = .0227, and S? = .018 

X/n i 



Obs . 


X 


N.H. 




BETA 


BIN0RM 


C0N0RM . 


NORM 


Score n 


Dens . 


Coeff . 


L 


U 


L 


U 


L 


U 


L 


U 








50% 


27 


36 


28 


33 


27 


36 


26 


34 


7 


. 2 


.001 


68% 


25 


39 


26 


40 


25 


38 


25 


36 








95% 


19 


46 


20 


47 


19 


45 


19 


42 








50% 


42 


53 


44 


54 


43 


53 


42 


51 


14 


.4 


.008 


6 8% 


40 


55 


42 


56 


41 


55 


41 


53 








95% 


33 


62 


35 


63 


34 


62 


35 


58 








50% 


58 


68 


59 


69 


59 


68 


59 


67 


21 


.6 


.038 


68% 


56 


70 


57 


71 


56 


70 


57 


69 








95% 


49 


77 


50 


77 


50 


76 


51 


74 








5 0% 


75 


83 


74 


83 


75 


83 


75 


83 


28 


.8 


.075 


6 8% 


73 


85 


72 


84 


73 


84 


73 


85 








95% 


66 


90 


66 


89 


66 


89 


68 


91 








50% 


93 


97 


95 


98 


95 


99 


91 


100 


35 


1:0 


.018 


68% 


92 


98 


94 


99 


94 


99 


89 


101 








95% 


87 


99 


90 100 


91 


100 


84 


10 7 




Note . 


KR20 


= .81, 


KR21 = 


.79, 


k = 


1. 


9, 






beta 


a - 7 


.109, 


and beta b 


- 0 


. 370. 


Dec 


i ~:al points 


on 



limits are omitted. 




5 b 



TABLE 3 



Tolerance- Intervals for n =..tl\ E X/n = .5, 
ol, » .0444, and S 2 . = .527 

y/n : 1 



Obs. x N.H. BETA BINORM CONORM NORM 



Score 


n 


Dens . 


Coef f . 


l L 


U 


L 


U 


L 


U 


L 


U 








56% 


20 


31 : 


22 


32 


21 


31 


20 


30 


5 


.2 


.034 


6 8% 


18 


34 


19 


35 


19 


34 


17 


33 








95% 


12 


42 


13 


43 


13 


41 


10 


40 








50% 


36 


48 


37 


48 


37 


48 


36 


47 


10 


.4 


.063 


68% 


33 


51 


34 


51 


34 


51 


34 


50 








95% 


25 


59 


26 


60 


27 


58 


26 


57 








50% 


52 


64 


52 


63 


52 


63 


53 


64 


15 


.6 


.063 


68% 


49 


67 


49 


66 


50 


66 


51 


66 








95% 


41 


75 


41 


74 


- 42 


73 


43 


74 








50% 


69 


80 


68 


78 


69 


79 


70 


80 


20 


. 8 


.034 


63% 


66 


82 


65 


81 


66- 


81 


67 


83 








95% 


58 


88 


58 


88 


57 


87 


60 


90 








50% 


87 


94 


91 


97 


92 


97 


86 


97 


25 


1.0 


.003 


68% 


85 


95 


39 


98 


91 


98 


84 


100 








95% 


78 


98 


84 


100 


86 


100 


76 


107 




Note, 


_KR20 


= .83, 


KR21 = 


.81, k 


= 1. 


6, 







beta a = 2.985, ah. I beta b = 2.985. Decimal points on 
limits are omitted. 



0 / 



TABLE 4 

Tolerance Intervals for n = 100* E.X/ri = .75, 

al , = .0119, and S? = .020 
X/n . i . 



Obs. x N.H. BETA BINORM CONORM NORM 
Score ri Dens. Goeff. L — U L U L U L U 

































50% 


25 


31 


25 


31 


25 


30 


25 


30 


20 


. 2 


.000 


68% 


24 


32 


24 


32 


24 


31 


24 


31 








95% 


21 


37 


21 


37 


20 


35 


20 


35 








50% 


42 


48 


43 


49 


42 


48 


43 


47 


40 


.4 


.001 


68% 


41 


50 


41 


50 


41 


50 


41 


48 








95% 


36 


54 


37 


55 


37 


54 


37 


52 








50% 


59 


65 


60 


66 


59 


65 


60 


65 


60 


.6 


.013 


68% 


58 


67 


58 


67 


58 


67 


58 


66 








9 5% 


53 


71 


54 


71 


54 


70 


55 


69 








50% 


77 


82 


77 


82 


77 


82 


77 


82 


80 


.8 


.036 


68% 


76 


83 


75 


83 


76 


83 


76 


83 








95% 


72 


86 


71 


86 


72 


86 


72 


87 








50% 


95 


98 


98 


99 


98 


99 


94 


99 


100 


1.0 


.000 


68% 


95 


98 


97 


99 


98 


100 


92 


100 








95% 


33-- 


-99 


96 


100 


96 


100 


90 


±G4 




Note. 


KR20 


= :87, 


KR21 = 


.85, k 


= 6. 


2, 







beta a = 13:137, and beta b = 4.379. Decimal points 
on limits are omitted. 



5 s 



55 



TABLE 5 



Mean-Absolute-Differences of Tolerance Limits 
for Seven Test Characteristics 







BETA- 


BETA- 


BINORM- 


BETA- 


B I NORM- 


eONORM- 


-_ 

Tes t 




BINORM 


C0N0RM 


C0N0RM 


NORM 


NORM 


N0RM 


Charac tens t ics 


C f^ ct f -P 


- - L 


-U 


L 


U 


L 


U 


t 


U 


L 


U 


L U 


n=25 EX/n=. 50 


50% 


Q 


Q 

y 


u 


c 
0 


c: 
D 


c: 
D 


D 


D 


1 9 


1 9 


8 8 


a *: e x / n ; = . U 4 4 


DO/0 


8 


8 


6 


6 


■ 6 


6 


9 


9 


14 


14 


9 9 


S (ij=.02 / 


9 DA 


6 


6 


11 


11 


8 


8 


18 


18 


21 


21 




. _ _ _ _ h 
n-25 EX/n=.75 


50% 


_ 
8 


8 


5 


7 


4 


4 




1 rs 
1(J 


7 




5 11 


a*: (X/n)= . 624 


68% 


8 


7 


4 


7 


5 


4 


7 


13 


8 


18 


7 1 A 


S 2 (i) = - 018 


95% 


7 


4 


9 


5 


6 


5 


18 


27 


17 


30 


ID Z o 


n=35 EX/n=:50 


50% 


7 


7 


- 
5 


C 

-> 


4 


4 


c 

J 


c 

J 


Q 

y 


Q 

y 


6 6 


(X/n)= .042 


68% 


6 


6 


5 


5 


5 


5 


7 


7 


11 


11 




S 2 (i) = .027 


95% 


5 


5 


9 


9 


8 


8 


16 


6 


18 


18 


19 19 


n=35 EX/n=. 75 


50% 


6 


6 


4 


5 


3 


3 


4 


9 


6 


12 


4 9 


a"- (X/n) = . 02 J 


68% 


6 


5 


4 


5 


4 


3 


7 


12 


7 


15 


6 13 


S 2 (i)=.018 


95% 


5 


3 


8 


5 


6 


5 


17 


24 


15 


26 


13 24 


n=50 EX/n=.60 


50% 


4 


4 


3 


3 


2 


2 


3 


3 


3 


5 


2 4 


: 2 (X/n) =.029 


68% 


4 


4 


3 


4 


3 


3 


4 


4 


4 


7 


3 6 


S 2 (i)=.020 


95% 


3 


3 


7 


5 


5 


5 


7 


10 


7 


13 


5 12 


n=75 EX/n=. 60 


50% 


3 


1 


2 


2 


2 


2 


2 


4 


3 


6 


2 4 


a 2 (X/n)=,023 


68% 


3 


j 


2 


3 


3 


3 


4 


5 


4 




3 5 


S 2 (i)=.022 


95% 


i 


2 


6 


• 5 


5 


5 


8 


16 


8 


12 


5 9 


n=l00 EX/n=. 75 


50% 


2 


2 


1 


2 


2 


2 


3 


3 


2 


4 


2 2 


J 2 (X/n) = .012 


68_% 


2 


2 


2 


3 


2 


2 


4 


4 


3 


5 


2 2 


S 2 (i)=.020 


95% 


9 





6- 





5- 


-4 


8 


9 


7 


10 


5 4 


fSleans are 


in thousandths ; 


i . e . 


, 5= 


,005, 















Derived from example test with n=35, EX/n=.75. 



5<i 

o _ 

ERIC 



TABLE 6 



Mean Dif f erences of Upper Limits 
- Eo£ Observed Scores Below the Mean 

E X/n_= .5 E X/n = . 75 

BETA- BINORM- CONORM- BETA- BINORM- CONORM- 
NORM NORM NORM NORM NORM - NORM - 



50% . 005 .010 . 004 .017 . 019 . 014 

n=35 68% .007 .012 .005 .020 .023 .018 

95% .017 .019 .008 .026 .029 .023 

50% .00 7 . 013 . 006 . 022 . 024 .018 

n=25 68% .009 .016 .007 .025 .027 .021 

95% .020 .022 .009 .029 .033 .026 



Uu 



57 



TABLE 7 
Mean Widths of Intervals 



Test Characteristics Coef f , 



-EETA BINORM CQNQRM NORM 



n=35 EX/n=.5 
a 2 (X/n v ) = .042 
S 2 (i)=.027 
KR20=. 87 



n-35 EX/n=. 75 
o 2 (X/n)=.023 
S 2 (i)=.018 
KR20=. 83 

ri-100 EX/n=. 75 

(X/n) = .012 
S 2 (i)=.O20 
Ki !0=.87 



50% 
68% 
95% 

50% 
68% 
95% 

50% 
68% 
95% 



.098 
. 144 
.281 

.083 
. 121 
.237 

.052 
.077 
.151 



.097 
.143 
.277 

.082 
.121 
.234 

.052 
.077 
. 150 



.091 
.135 
.262 

078 

115 
224 

.049 
.072 
.141 



.092 
.137 
.267 



.080 
.118 
.232 

.053 
.078 
.153 



6*1 



TABLE 3 



Confidence Intervals for n = 35 



Obs. 


X 




Binomial 


Normal 


Comp . 


-Bin. 


Score 


n 


Loet t . 


L 


U 


L 


u 


T 

U 


IT 
U 






50% 


15 


27 


15 


25 


15 


26 


7 


.2 


68% 


13 


29 


13 


27 


13 


28 






" J/o 


8 


37 


6 


34 


o 


j j 






50% 


33 


47 


35 


45 


33 


47 


14 


\n 


68% 


31 


50 


33 


47 


31 


56 






95% 


24 


58 


26 


54 


24 


5 7 






50% 


53 


67 


55 


65 


53 


67 


21 


.6 


6 8% 


50 


69 


53 


67 


51 


69 






95% 


42 


76 


46 


74 


43 


76 






50% 


73 


85 


75 


• 85 


74 


85 


28 


.8 


68% 


71 


87 


73 


87 


72 


87 






95% 


63 


92 


66 


94 


65 


92 






50% 


96 


100 


95 


105 


99 


100 


35 


1.0 


68% 


95 


100 


93 


10 7 


98 


100 






95% 


90 


100 


86 


114- 


— 95 


100 




Note : 


Decimal 


points 


on 


limits 


are omit ted . 





TABLE 9 

Confidence — Intervals — for ri = 100 



Obs. x Binomial Normal Comp. -Bin 



Score 


n 


Coef f . 


L 


U 


L 


U 


L 


U 






50% 


17 


23 


17 


23 


• 17 


23 


20 


. 2 


68% 


16 


25 


16 


24 


16 


24 






95% 


13 


29 


13 


28 


13 


28 






50% 


36 


44 


37 


43 


36 


44 


40 


.4 


68% 


35 


46 


36 


44 


35 


45- 






95% 


30 


50 


32 


48 


31 


50 






50% 


56 


64 , 


57 


* 63 


56 


64 


60 - 


.6 


68% 


JO 


65 


56 


64 


55 


65 






95% 


50 


70 


52 


68 


50 


: 69 






50% 


77 


83 


77 


83 


77 


83 


80 


. 8 


68% 


75 


84 


• 76 


' 84 


76 


84 






95% 


71 


87 


■72 


88 


72 


87 






50% 


99 


100 - 


97 


10 3 


100 


100 


100 


1.0 , 


; 68% 


98 


100 


96 


104 


99 


100 






95% 


96- 


-100 - 


- 92- 


108 


- 98 


100 



Note : Decimal points on limits are omitted ; 



